similarity kernel
Review for NeurIPS paper: Convergence and Stability of Graph Convolutional Networks on Large Random Graphs
Summary and Contributions: This paper presents theoretical analysis of convergence and stability properties of GCNs on large random graphs. It introduces continuous GCNs (c-GCN) that act on a bounded, piecewise-Lipschitz function of unobserved latent node variables which are linked through a similarity kernel. It has two main contributions. Firstly, it studies notions of invariance and equivariance to isomorphism of random graph models, and give convergence results of discrete GCNs to c-GCNs for large graphs. Specifically, for the invariant case the authors claim that the output of both networks lie in the same output space.
Reviews: Human-in-the-Loop Interpretability Prior
More than that, it depends on the purpose for which an explanation is being desired. Assessing whether a model is fit-for-purpose would entail defining a specific task, which as you state is not something you do in this paper. Nevertheless I think it's an important part of framing the general problem.
Zero Shot Molecular Generation via Similarity Kernels
Elijoลกius, Rokas, Zills, Fabian, Batatia, Ilyes, Norwood, Sam Walton, Kovรกcs, Dรกvid Pรฉter, Holm, Christian, Csรกnyi, Gรกbor
Gaussian, an approach known as denoising score matching [10-12]. In the context of molecule generation, the score is The combinatorial scaling of the available chemical closely related to atomic forces. Consider training data space with molecule size is one of the main challenges that comprise configurations sampled using molecular in the design of new molecules and materials. Generative dynamics or other methods from an underlying Boltzmann modelling aims to solve this by directly proposing distribution, x exp ( ฮฒU(x)) /Z. Here, x = structures with desirable properties, without exhaustively {r, z} is a set that represents a molecule, with r the enumerating and screening candidates. Recently, atomic positions and z the chemical elements, U(x) the diffusion-based models have achieved impressive results potential energy, ฮฒ the inverse temperature, and Z the in molecular docking [1] and generation of linkers [2], partition function. In this case, when the elements z drug-like molecules [3, 4] and crystal structures [5, 6]. are fixed, the score of the data distribution s(x, 0) corresponds Diffusion models are trained to reverse a stochastic to the atomic force (defined as the negative gradient noising process, which gradually corrupts samples of of the potential energy) up to a multiplicative constant: training data until they are indistinguishable from samples drawn from an uninformative prior distribution, such as a standard Gaussian [7-9].
SBSM-Pro: Support Bio-sequence Machine for Proteins
Wang, Yizheng, Zhai, Yixiao, Ding, Yijie, Zou, Quan
Bio-sequences, which include DNA, RNA, and proteins, are the molecular foundation of modern genetic research. The classification of bio-sequences based on sequence information has been a key focus in bioinformatics research. At present, with the sequential completion of genome mapping from humans to various species, we have amassed a vast amount of sequence data, creating an urgent need for computer-assisted annotation of sequence functions. Although it is statistically evident that genetic sequences determine hereditary diseases, the mechanisms by which sequence variations contribute to diseases are intricately complex. It is difficult to address and interpret all these issues through one biological experiment; hence, multiple computer predictions are needed to guide the progression of wet lab exploration. In summary, the application of information science and machine learning to bio-sequence classification is a valuable tool for assisting researchers in comprehending and analysing bio-sequences. It serves as a key driving force for advancing research in the field of bioinformatics. In the field of bio-sequence classification, machine learning methods are broadly pursued using two strategies: feature extraction combined with traditional classification methods and direct sequence classification via deep learning techniques. For bio-sequences, relevant features are mainly characterized as frequency, physicochemical, structural, and evolutionary features.
t-METASET: Tailoring Property Bias of Large-Scale Metamaterial Datasets through Active Learning
Lee, Doksoo, Chan, Yu-Chin, Chen, Wei Wayne, Wang, Liwei, van Beek, Anton, Chen, Wei
Inspired by the recent achievements of machine learning in diverse domains, data-driven metamaterials design has emerged as a compelling paradigm that can unlock the potential of multiscale architectures. The model-centric research trend, however, lacks principled frameworks dedicated to data acquisition, whose quality propagates into the downstream tasks. Often built by naive space-filling design in shape descriptor space, metamaterial datasets suffer from property distributions that are either highly imbalanced or at odds with design tasks of interest. To this end, we present t-METASET: an active-learning-based data acquisition framework aiming to guide both diverse and task-aware data generation. Distinctly, we seek a solution to a commonplace yet frequently overlooked scenario at early stages of data-driven design of metamaterials: when a massive (~O(10^4 )) shape-only library has been prepared with no properties evaluated. The key idea is to harness a data-driven shape descriptor learned from generative models, fit a sparse regressor as a start-up agent, and leverage metrics related to diversity to drive data acquisition to areas that help designers fulfill design goals. We validate the proposed framework in three deployment cases, which encompass general use, task-specific use, and tailorable use. Two large-scale mechanical metamaterial datasets are used to demonstrate the efficacy. Applicable to general image-based design representations, t-METASET could boost future advancements in data-driven design.
Multi-Time Attention Networks for Irregularly Sampled Time Series
Shukla, Satya Narayan, Marlin, Benjamin M.
Irregular sampling occurs in many time series modeling applications where it presents a significant challenge to standard deep learning models. This work is motivated by the analysis of physiological time series data in electronic health records, which are sparse, irregularly sampled, and multivariate. In this paper, we propose a new deep learning framework for this setting that we call Multi-Time Attention Networks. Multi-Time Attention Networks learn an embedding of continuous time values and use an attention mechanism to produce a fixed-length representation of a time series containing a variable number of observations. We investigate the performance of our framework on interpolation and classification tasks using multiple datasets. Our results show that our approach performs as well or better than a range of baseline and recently proposed models while offering significantly faster training times than current state-of-the-art methods. Irregularly sampled time series occur in applications including healthcare, climate science, ecology, astronomy, biology and others. It is well understood that irregular sampling poses a significant challenge to machine learning models, which typically assume fully-observed, fixed-size feature representations (Yadav et al., 2018). While recurrent neural networks (RNNs) have been widely used to model such data because of their ability to handle variable length sequences, basic RNNs assume regular spacing between observation times as well as alignment of the time points where observations occur for different variables (i.e., fully-observed vectors). In practice, both of these assumptions can fail to hold for real-world sparse and irregularly observed time series.
METASET: Exploring Shape and Property Spaces for Data-Driven Metamaterials Design
Chan, Yu-Chin, Ahmed, Faez, Wang, Liwei, Chen, Wei
Data-driven design of mechanical metamaterials is an increasingly popular method to combat costly physical simulations and immense, often intractable, geometrical design spaces. Using a precomputed dataset of unit cells, a multiscale structure can be quickly filled via combinatorial search algorithms, and machine learning models can be trained to accelerate the process. However, the dependence on data induces a unique challenge: An imbalanced dataset containing more of certain shapes or physical properties can be detrimental to the efficacy of data-driven approaches. In answer, we posit that a smaller yet diverse set of unit cells leads to scalable search and unbiased learning. To select such subsets, we propose METASET, a methodology that 1) uses similarity metrics and positive semi-definite kernels to jointly measure the closeness of unit cells in both shape and property spaces, and 2) incorporates Determinantal Point Processes for efficient subset selection. Moreover, METASET allows the trade-off between shape and property diversity so that subsets can be tuned for various applications. Through the design of 2D metamaterials with target displacement profiles, we demonstrate that smaller, diverse subsets can indeed improve the search process as well as structural performance. By eliminating inherent overlaps in a dataset of 3D unit cells created with symmetry rules, we also illustrate that our flexible method can distill unique subsets regardless of the metric employed. Our diverse subsets are provided publicly for use by any designer.
Similarity Kernel and Clustering via Random Projection Forests
Yan, Donghui, Gu, Songxiang, Xu, Ying, Qin, Zhiwei
Similarity plays a fundamental role in many areas, including data mining, machine learning, statistics and various applied domains. Inspired by the success of ensemble methods and the flexibility of trees, we propose to learn a similarity kernel called rpf-kernel through random projection forests (rpForests). Our theoretical analysis reveals a highly desirable property of rpf-kernel: far-away (dissimilar) points have a low similarity value while nearby (similar) points would have a high similarity}, and the similarities have a native interpretation as the probability of points remaining in the same leaf nodes during the growth of rpForests. The learned rpf-kernel leads to an effective clustering algorithm--rpfCluster. On a wide variety of real and benchmark datasets, rpfCluster compares favorably to K-means clustering, spectral clustering and a state-of-the-art clustering ensemble algorithm--Cluster Forests. Our approach is simple to implement and readily adapt to the geometry of the underlying data. Given its desirable theoretical property and competitive empirical performance when applied to clustering, we expect rpf-kernel to be applicable to many problems of an unsupervised nature or as a regularizer in some supervised or weakly supervised settings.